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Abstract — This paper describes performance bounds for 
compressed sensing (CS) where the underlying sparse or 
compressible (sparsely approximable) signal is a vector of 
nonnegative intensities whose measurements are corrupted 
by Poisson noise. In this setting, standard CS techniques 
cannot be applied directly for several reasons. First, the usual 
signal-independent and/or bounded noise models do not apply 
to Poisson noise, which is non-additive and signal-dependent. 
Second, the CS matrices typically considered are not feasible in 
real optical systems because they do not adhere to important 
constraints, such as nonnegativity and photon flux preservation. 
Third, the typical minimization leads to overfltting in the 

high-intensity regions and oversmoothing in the low-intensity 
areas. In this paper, we describe how a feasible positivity- and 
flux-preserving sensing matrix can be constructed, and then 
analyze the performance of a CS reconstruction approach for 
Poisson data that minimizes an objective function consisting of 
a negative Poisson log likelihood term and a penalty term which 
measures signal sparsity. We show that, as the overall intensity 
of the underlying signal increases, an upper bound on the 
reconstruction error decays at an appropriate rate (depending 
on the compressibility of the signal), but that for a fixed signal 
intensity, the signal-dependent part of the error bound actually 
grows with the number of measurements or sensors. This 
surprising fact is both proved theoretically and justified based 
on physical intuition. 

Keywords: complexity regularization, nonpar ametric estima- 
tion, sparsity, photon-limited imaging, compressive sampling 



I. Introduction 

The basic idea of compressed sensing (CS) is that, when 
the signal of interest is very sparse (i.e., zero-valued at most 
locations) or highly compressible in some basis, relatively 
few "incoherent" observations are sufficient to reconstruct 
the most significant non-zero signal components [2j. 
Despite the promise of this theory for many applications, 
very little is known about its applicability to photon-limited 
imaging systems, where high-quality photomultiplier tubes 
are expensive and physically large, limiting the number of 
observations that can reasonably be acquired by the system. 
Limited photon counts arise in a wide variety of applications. 
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including infrared imaging, nuclear medicine, astronomy and 
night vision, where the number of photons detected is very 
small relative to the number of pixels, voxels, or other entities 
to be estimated. Computational optics techniques, compressed 
sensing principles, and robust reconstruction methods can 
potentially lead to many novel imaging systems designed to 
make the best possible use of the small number of detected 
photons while reducing the size and cost of the detector array. 
Recent work has empirically explored CS in the context of 
photon limited measurements |3,|-|j6J, but theoretical perfor- 
mance bounds similar to those widely cited in the conventional 
CS context previously remained elusive. 

This is in part because the standard assumption of signal- 
independent and/or bounded noise (cf. |j7|, (8)) is violated 
under the Poisson noise models used to describe images 
acquired by photon-counting devices |^. The Poisson obser- 
vation model 

2/~Poisson(Ar), (1) 

where /* G is the signal or image of interest, A e K^^™ 
linearly projects the scene onto an A^-dimensional space of 
observations, and y E is a length- vector of observed 
Poisson counts, stipulates that the likelihood of observing a 
particular vector of counts y is given by 

where {Af*)j is the j*^ component of Af*. Moreover, in 
order to correspond to a physically realizable linear optical 
system, the measurement matrix A must be: 

• Positivity-preserving — for any nonnegative input signal 
/, the projected signal Af must also be nonnegative. 
Using the standard notation / h to denote the non- 
negativity of /, we can write this condition as 

/ ^ =^ AfhO. 

• Flux-preserving — for any input signal f h 0, the mean 
total intensity of the observed signal A f must not exceed 
the total intensity incident upon the system: 

N m 
i=l i=l 

A. Surprising main result 

In this paper, we make the following contributions: 
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• design physically realizable sensing matrices. A, which 
incorporate the above positivity and photon flux preser- 
vation constraints; 

• propose a penalized-likelihood objective function for re- 
constructing /* from y observed according to ([T]i; 

• derive upjper bounds on the error between /* and the 
estimate / and demonstrate how the error scales with the 
overall intensity (/ = J2i fD^ 'he size of /* (to), the 
number of measurements (N), and the compressibility of 
the signal in some basis (a); and 

• present empirical results demonstrating the efficacy of the 
proposed method. 

In particular, the main theoretical result presented in this paper 
shows that, for an a-compressible signal of total intensity ^ 



reconstruction error cx N 



log TO 
/ 



log(TO/A^) 

N 



for N sufficiently large. (As we show in Section III-B there is 
a threshold effect in that the number of measurements N must 
be large enough to guarantee that the per-sensor reconstruction 
error decays with the incident signal intensity /.) Since the 
total number of observed events or photons, n = X^ili Vi^ 
is the realization of a Poisson random variable with intensity 
/, the bound reflects how error scales with the number of 
observed events. 

While the rate of error decay as a function of the total 
intensity, /, coincides with earlier results in denoising contexts, 
the proportionality of the intensity-dependent term in the 
error to N may seem surprising at first glance. However, 
one can intuitively understand this result from the following 
perspective. If we increase the number of measurements {N) 
while keeping the expected number of observed photons (/) 
constanj^ the number of photons per sensor will decrease, so 
the signal-to-noise ratio (SNR) at each sensor will likewise 
decrease, thereby degrading performance. Having the number 
of sensors exceed the number of observed photons is not 
necessarily detrimental in a denoising or direct measurement 
setting (i.e., where A is the identity matrix) because multiscale 
algorithms can adaptively bin the noisy measurements together 
in homogeneous regions to achieve higher SNR overall \\(^ , 
[ [TT) . However, in the CS setting the signal is first altered by 
the compressive projections in the sensing matrix A, and the 
raw measurements cannot themselves be binned to improve 
SNR. In particular, there is no natural way to aggregate 
measurements across multiple sensors because the aggregation 
effectively changes the sensing matrix in a way that does not 
preserve critical properties of A. 

One might also be surprised by this main result because 
in the case where the number of observed photons is very 
large (so that SNR is quite high and not a limiting factor), our 

'More precisely, / refers to the total intensity integrated over the exposure 
time, so that increasing / can be associated with more source intensity, longer 
exposure time per measurement, or both. 

-In some systems, such as a single-detector system, more measurements 
might seem to suggest more observed photons. However, holding / fixed 
while increasing implies that each measurement is collected over a shorter 
exposure. Thus increasing A'^ does not correspond to an increase in the number 
of observed events/photons. 



bounds do not converge to the standard performance bounds 
in CS. This is because our bounds pertain to a sensing matrix 
A which, unlike conventional CS matrices based on i.i.d. 
realizations of a zero-mean random variable, is designed to 
correspond to a feasible physical system. In particular, every 
element of A must be nonnegative and appropriately scaled, so 
that the observed photon intensity is no greater than the photon 
intensity incident on the system (i.e., we cannot measure more 
light than is available). This rescaling dramatically impacts 
important elements of any performance bounds, including the 
form of the restricted isometry property 1 12 1, p3j , even in the 
case of Gaussian or bounded noise. (Additional details and 



interpretation are provided in Section II-B after we introduce 
necessary concepts and notation.) 

As a result, incorporating these real-world constraints into 
our measurement model has a significant and adverse impact 
on the expected performance of an optical CS system. 

B. Relation to previous CS performance bounds 

The majority of the compressed sensing literature assumes 
that there exists a "sparsifying" reference basis W, so that 
9* = f* is sparse or lies in a weak-^p space. When the 
matrix product AW obeys the so-called restricted isometry 
property (RIP) p2| , |jl3| or some related criterion, and when 
the noise is bounded or Gaussian, then 6* can be accurately 
estimated from y by solving the following £2 — ^1 optimization 
problem (or some variant thereof): 



9 = argmin [\\y - AW9\\l + r||0||i] 



(2) 



where r > is a regularization parameter Q, 1 13 1, 1 14 1. 

However, the £2 data-fitting term, \\y — AWOW^, is prob- 
lematic in the presence of Poisson noise. Because under 
the Poisson model the variance of the noisy observations is 
proportional to the signal intensity, £2 data-fitting terms can 
lead to significant overfitting in high-intensity regions and 
oversmoothing in low-intensity regions. Furthermore, photon- 
limited imaging systems impose hard constraints on the nature 
of the measurements that can be collected, such as non- 
negativity, which are not considered in much of the exist- 
ing compressed sensing literature (recent papers of Dai and 
Milenkovic |15| and of Khajehnejad et al. |16| are notable 
exceptions). Bunea, Tsybakov and Wegkamp 1 17 1 study the re- 
lated problem of using li regularization for probability density 
estimation, but rather than assuming incoherent measurements 
of a random variable (similar to our setup), they assume direct 
observations of a random variable and learn, for example, 
a sparse mixture model. Furthermore, their work assumes 
infinite precision in the observed realizations of the random 
variable, so that their analysis does not provide any insight 
into how the number or bit depth of detector elements impacts 
performance. More recent work by Rish and Grabamik | [T8| 
explores methods for CS reconstruction in the presence of 
exponential family noise using generalized linear models, but 
does not account for the physical constraints (such as flux 
preservation) associated with a typical Poisson setup. 

In this paper, we propose estimating /* from y using a 
regularized Poisson log-likelihood objective function as an 
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alternative to (|2]), and we present risk bounds for recovery of 
a compressible signal from Poisson observations. Specifically, 
in the Poisson noise setting we maximize the log-likelihood 
while minimizing a penalty function that, for instance, could 
measure the sparsity of 6 — f: 



f — arg min 



N 

E 



[(A/),-y,log(A/),]+rpcn(/) 



subject to / ^ 0, ft= I 



(3) 



where pen(-) is a penalty function that will be detailed later, 
and / is the known total intensity of the unknown /*. 

C. Organization of the paper 

Section [U] contains the problem formulation, describes the 
proposed approach, and details the construction and prop- 
erties of a feasible sensing matrix A. 



Ill 



In Section 

develop an oracle inequality for the proposed estimator and 
then use it to establish risk bounds for compressible signals. 



Section IV contains a proof-of-concept experiment based on 
recent breakthroughs in sparse reconstruction methods we 
initially proposed in |4|. For the sake of readabiUty, proofs 
of all theorems are relegated to the appendices. 

II. Problem formulation and proposed approach 

We have a signal or image /* ^ of size m that we wish 
to estimate using a detector array of size N ^ m. We assume 
that the total intensity of /*, given by / = = 2™ ^ /*, 

is known a priori. We make Poisson observations of Af*, 
y ~ Poisson(^/*), where A G M^^™ is a positivity- and 
flux-preserving sensing matrix. Our goal is to estimate /* e 
M'P from y eZ^. 

A. Construction and properties of the sensing matrix 

We consider sensing matrices composed of zeros and 
(scaled) ones, where p is the probability of having a zero and 
1 — p is the probability of having a one. In the context of 
optical systems, small p corresponds to sensing matrices with 
many ones, which allow most of the available light through 
the system to the detectors. Conversely, large p corresponds to 
having each measurement being the sum of a relatively small 
number of elements in the signal of interest. To explore the 
tradeoff between these two regimes, we explicitly consider p 
throughout our analysis. 

We construct our sensing matrix A as follows. Given some 
p € (0,1), let Vp denote the probability distribution of a 
random variable that takes values 



a; 



\+ — 



1-p 



l-p 



with probability p\ 
with probability 1 — p. 



Note that Vi /2 is the usual Rademacher distribution which puts 
equal mass on —1 and on +1. Let Z be an iV x m matrix 
whose entries Zi,j are drawn i.i.d. from Vp. We observe that 



for all 1 < z, fc < and 1 < j, ^ < m. Most compressed 
sensing approaches would proceed bj assuming that we make 
(potentially noisy) observations of Af* , where A = Zl\fN . 
However, A will, with high probability, have at least one nega- 
tive entry, which will render this observation model physically^ 
unrealizable in photon-counting systems. Therefore, we use A 
to generate a feasible sensing matrix A as follows. Let \ry.s 
denote the r x s matrix all of whose entries are equal to \. 
Then we take 



A - Y ^ ^+ iv 

We can immediately deduce the following properties of A: 

• It is positivity-preserving because each of its entries is 
either or 1 /iV. 

• It is flux-preserving, i.e., for any / G M™ we have 

P/lli < IL/lli- (5) 
This is easy to see directly: if / ^ 0, then Aj > 0, and 

N in 771 

i=l J=l J=l 

• With probability at least 1 — A^p™ (w.rt. the realization 
of {Zij}), every row of A has at least one nonzero entry. 
Assume that this event holds. Let / G M™ be an arbitrary 
vector of intensities satisfying / )^ (c/)lmxi for some 
c> 0. Then 

Afh'^lNxi- (6) 



To see this, write 

m ^ 



cl 
N' 



Furthermore, and most importantly, with high probability A 
acts near-isometrically on certain subsets of M™. The usual 
formulation of this phenomenon is known in the compressed 
sensing literature as the restricted isometry property (RIP) 
| fT2l , fTS I, where the subset of interest consists of all vectors 
with a given sparsity. However, as shown recently by Mendel- 
son et al. p9) , pO) , the RIP is, in fact, a special case of 
a much broader circle of results concerning the behavior of 
random matrices whose entries are drawn from a subgaussian 
isotropic ensemble. These terms are defined in Appendix |A] 
where we also prove the following result: 



Theorem 1 Consider the matrix A = Z /\/ N, where the 
entries of Z are drawn i.i.d. from Vp. Then there exist absolute 
constants Ci , C2 > such that the bound 



\u - v\\2 < A\\Au - Av\\2 + 



2 2clCl\og{c2Ctm/N) 



N 



(7) 



will hold simultaneously for all u,v G BY^ with probability at 



EZij- = 



and EZi^jZk^i = SikS^e 



(4) 



least 1 - e-'=i^/'^p, where B'^ ^ {u £ 
C = 



" 1 



1, ifp = 1/2. 



1} and 
(8) 
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Moreover, there exist absolute constants 03,04 > such that 
for any finite T C S""-\ where 5"^^ = {u G M™ : ||u||2 = 
1} is the unit sphere {£2), 



1/2 <\\Au\\l <3/2, yueT 



(9) 



holds with probability at least 1 — e '^^^/''p, provided N > 
c,C^\og\T\. 

B. DC offset and noise 

The intensity underlying our Poisson observations can be 
expressed as 



Af* 



p{l -p) 



Af* 



N ' N 

As described in Theorem [l] the idealized sensing matrix A 
has a RIP-Hke property which can lead to certain performance 
guarantees if we could measure Af* directly; in this sense, 
Af* is the informative component of each measurement. 
However, a constant DC offset proportional to / is added 
to each element of Af* before Poisson measurements are 
collected, and elements of Af* will be very small relative to 
/. Thus the intensity and variance of each measurement will 
be proportional to /, overwhelming the informative elements 
of Af* (To see this, note that yi can be approximated as 
{Af*)i + ^ {Af*)i^i, where is a Gaussian random variable 
with variance one.) 

As we will show in this paper, the Poisson noise variance 
associated with the DC offset, necessary to model feasible 
measurement systems, leads to very different performance 
guarantees than are typically reported in the CS literature. The 
necessity of a DC offset is certainly not unique to our choice of 
a Rademacher sensing matrix; it has been used in practice for a 



wide variety of linear optical CS architectures (cf. |21 1-|24|). 
A notable exception to the need for DC offsets is the expander- 
graph approach to generating non-negative sensing matrices, 
which has recently been applied in Poisson CS settings [25|; 
however, theoretical results there are limited to signals which 
are sparse in the canonical (i.e. Dirac delta or pixel) basis. 

As a result, the framework and bounds established in this 
paper have significant and sobering implications for any linear 
optical CS architecture operating in low-light settings. 

C. Reconstruction approach and bounds 

We propose solving the following optimization problem: 



/ = arg min - \ogp{y\Af) + 2 pen(/) 
/er L 



(10) 



where pen(/) > is a penalty term. Here, F = T{m, I) 
is a countable set of feasible estimators / G M™ satisfying 
J2iLi fi = the penalty function satisfies the Kraft 

inequality: 

^e-P™(^)<l. (11) 
/er 

'We would like to thank Emmanuel Candes and an anonymous reviewer 
for helpful insights on this point. 



(In Q and r is a free parameter that could be selected by 
the user, while in (10 1 it is fixed at 2 for a penalty function 
that satisfies the Kraft inequality. In practice one often prefers 
to use a value of r different from what is supported in theory 
because of slack in the bounds.) While the penalty term may 
be chosen to be smaller for sparser solutions 9 — W'^ f, 
where W is an orthogonal matrix that represents / in its 



"sparsifying" basis, our main result only assumes that ( 1 1 



is satisfied. In fact, a variety of penalization techniques can be 
used in this framework; see pO), |26|, p7) for examples and 



discussions relating Kraft-compliant penalties to prefix codes 
for estimators. Many penalization or regularization methods in 
the literature, if scaled appropriately, can be considered prefix 



codelengths. We can think of (10 1 as a discretized-feasibility 
version of ([3]), where we optimize over a countable set of 
feasible vectors that grows in a controlled way with signal 
length TO. 

We will bound the accuracy with which we can estimate 
/*//; in other words, we focus on accurately estimating the 
distribution of intensity in /* independent of any scaling 
factor proportional to the total intensity of the scene, which 
is typically of primary importance to practitioners. Since 
the total number of observed events, n, obeys a Poisson 
distribution with mean /, estimating / by n is the strategy 
employed by most methods. However, the variance of this 
estimate is /, which means that, as / increases, our ability to 
estimate the distribution improves, while accurately estimating 
the unnormalized intensity is more challenging. We chose to 
assume / is known to discount this effect. The quality of a 
candidate estimator / will be measured in terms of the risk 

Rirj) = j^\\r-f\\l 

D. Summary of notation 

Before proceeding to state and prove risk bounds for the 
proposed estimator, we summarize for the reader's conve- 
nience the principal notation used in the sequel: 

• to: dimension of the original signal 

• to): number of measurements (detectors) 

• /* G M™: unknown nonnegative-valued signal 

• I — /*: total intensity of /*, assumed known 

• Z £ M^^™: random matrix with i.i.d. entries taking 
values — \/(l — p)/p with probability p and \Jp/{l — p) 
with probability 1 — p, where p G (0, 1) is a tunable 
parameter 

» A = Z/\/N : scaled matrix Z (cf. Theorem[r|for its norm 
preservation properties) 

• (p: subgaussianity constant of A, defined in (|8]l 

• ci, C2, C3, C4: absolute constants from Theorem [T] 

. A = y/p{l^p)/NA + iV- V(l - P)livxm: physically 
realizable sensing matrix, obtained by shifting and scaling 
of A; satisfies positivity and flux preservation require- 
ments 



r C 



finite or countably infinite set of candidate 



estimators with a penalty function pen : F M_|- 
satisfying the Kraft inequality ([TTJ 
Rif*J) = II/* - fWl/I^- the risk of a candidate 
estimator / 
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• /: the penalized maximum-likelihood estimator taking 
values in F, given by the solution to ([T0| 
Other notation will be defined as needed in the appropriate 
sections. 

III. Bounds on the expected risk 

Now we are in a position to establish risk bounds for the 
proposed estimator ( [TO| i. Theorem |2] in Section |ni-A| is a 
general risk bound that holds (with high probability w.rt. the 
realization of A) for any sufficiently regular class of candidate 
estimators and a suitable penalty functional satisfying the Kraft 
inequality. Section 111-B then particularizes Theorem [2] to the 
case in which the unknown signal /* is compressible in some 
known reference basis, and the penalty is proportional to the 
sparsity of a candidate estimator in the reference basis. 

A. An oracle inequality for the expected risk 

In this section we give an upper bound on the expected risk 
ER{f*,f) that holds for any target signal /* ^ satisfying 
the normalization constraint Y^^iLi ft — without assuming 
anything about the sparsity properties of /*. Conceptually, 
our bound is an oracle inequality, which states that the 
expected risk of our estimator is within a constant factor of 
the best regularized risk attainable by estimators in F with 
full knowledge of the underlying signal /*. More precisely, 
for each / G F define 

and for every T C F define 

i?*(r,T)^ mini?* (/*,/), 

i.e., the best penalized risk that can be attained over T by 
an oracle that has full knowledge of /*. We then have the 
following: 

Theorem 2 Suppose that, in addition to the conditions stated 
in Section the feasible set F also satisfies the condition 

Af h (c//iV)l^xi, V/ e F (12) 

for some < c < 1. Let Qn,p be the collection of all subsets 
T C F, such that |T| < 2^/'=*^. Then the following holds 



with probability at least 1 



me 



-KN 



for some positive K 



K{ci,c^,p) (with respect to the realization of A): 

2c2c4log(c2C>/iV) 



Ei?(r,/)<Cjv,p min i?*(r,T) 



N 



(13) 



where 



C 



N.p 



24 



16 



N 



c ' p(l -p)^ 

and the expectation is taken with respect to y 
Poisson(y4/*). 



Remark 1 One way to satisfy the positivity condition (12i is 
to construct F so that 



/ > (Cl)lr 



V/eF. 



(14) 



The desired condition ^V2\ then follows from (|6]l. A condition 
similar to ([T4]l is natural in the context of estimating vectors 
with nonnegative entries from count data, as it excludes the 
possibility of assigning zero intensity to an input of a detector 
when at least one photon has been counted p8). 



Remark 2 Both Cn.p and are minimized when p = 1/2, 
suggesting that altering the sensing architecture to have p ^ 
1 /2 may impair performance, despite the fact that increasing p 
would increase the expected total number of observed events 
(photons) and decreasing p would decrease the DC offset of 
the measurements and hence measurement noise variance. 

Remark 3 Observe that for any pair A^i < N2 we 
have the inclusion Qnx,p ^ Gn2,p^ which implies that 
minxeejv R*{f*,T) is a decreasing function of N. Hence, 
the first term on the right-hand side of ( 13 1 is the product 
of a quantity that increases with N (i.e., Cn,p) and one 
that decreases with N. Combined with the presence of the 
0{N^^ log{m/N)) additive term, this points to the possibility 
of a threshold effect, i.e., the existence of a critical number of 
measurements TV*, below which the expected risk may not 
monotonically decrease with N or /. 



B. Risk bounds for compressible signals 

We now use Theorem [2] to analyze the performance of the 
proposed estimator when the target signal /* is compressible 
(i.e., admits a sparse approximation) in some orthonormal 
reference basis. 

Following fT\, we assume that there exists an orthonormal 
basis $ = {(f)i, . . . , (pm} of M™, such that /* is compressible 
in $ in the following sense. Let W be the orthogonal matrix 
with columns (pi, ... , (j>„i- Then the vector 6* of the coeffi- 
cients 6* = {f*,(pj) of /* in $ is related to /* via /* = W9*. 
Let 



y^j^^ be the decreasing rearrangement of w . 

.'(1)1 ^ i'^(2)i — •■• ^ l^(m)l- assume that there exist 
some < g < 00 and p > 0, such that 



'(1)' 
> 



< Plj 



-1/9 



L 



(15) 



Note that for every 1 < j < to we have 



< 



= ii.rii2<iiriii = /, 



so we can take p to be a constant independent of / or to. 



Any 9* satisfying (15 1 is said to belong to the weak-iq ball 



of radius pi. The weak-fg condition (15i translates into the 
following approximation estimate |1|; given any 1 < fc < to, 
let 6'^*^^ denote the best fc-term approximation to 6*. Then 



1 

P 



\9* - e'^'^^wl < Cp^k- 



l/q-l/2 (16) 



for some constant C > that depends only on q. We also 



assume that /* satisfies the condition ( 14 1 for some c G (0,1) 
a lower bound on which is assumed known. 
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Theorem 3 There exist a finite set of candidate estimators T 
and a penalty function satisfying Kraft's inequality, such that 
the bound 

k k log2 m 



ER(f*f)<0(N) min 



-2a 



I 



Ol'i^^L (17) 



N 



where 



k^{N) 



N 



2c4C;Jlog2m' 

holds with the same probability as in Theorem^ The constants 
obscured by the O(-) notation depend only on p, p, C and c. 

The proof is presented in Appendix [Cj here we highlight a 
number of impHcations: 

1) In the low -intensity setting / < mlogm, we get the risk 
bound 

2k log2 m 



ER(f*f)<0(N) min 



O 



-2a 



I 



\og{m/N)\ 



N 



If k^,{N) > (a// log2 m)^/(^"+^\ then we can further obtain 



ER{f*,f)<0{N) 



log m 



O 



logim/N) 
N 



If k^{N) < (a// log2 there are not enough 

measurements, and the estimator saturates, although its risk 
can be controlled. 

2) In the high-intensity setting I > ni log m, we obtain 

2fc1 



ER(f*J)<0(N) min 



O 



m 

log(r7i/iV) 
N 



If k^{N) > (Q;m)i/(2"+i\ then we can further get 

'log(m/A^) 



Ei?(/*, /) < OiN)m-^ + O 



N 



Again, if k^,{N) < (am)^/(^"+^-', there are not enough 
measurements, and the estimator saturates. 
3) When I x m and N x m^^^ for some /3 > 1 + 1/2q;, we 
get (up to log terms) the rates 



ER{f* J) = O {m--^) , 



where 7 



2a-(2a+l)//9 
2a+l 



> 0. 

IV. Experimental Results 



In this section we present the results of a proof-of-concept 
experiment showing the effectiveness of sparsity-regularized 
Poisson log likelihood reconstruction from CS measurements. 
In previous work p], we described an optimization formula- 
tion called SPIRAL (Sparse Poisson Intensity Reconstruction 
Algorithms) for solving a simpler variant of ([3]): 

/ = arg min + r pen(/)] subject to / ^ 0, (18) 

/ 



where <j>{f) = J2ji^f)j ^ Vj log(^/)i- '^^^ setup, since A 
has nonnegative entries, the constraint Af ^ in (j3]l is redun- 
dant. Additionally, we do not require that the total intensity of 
the reconstruction / must sum to / since the resulting problem 
is easier to solve, and this equality constraint, in general, is 
approximately satisfied at the solution, i.e., it is not necessary 
to obtain accurate experimental results. 

The proposed approach sequentially approximates the ob- 



jective function in ( 18 1 by separable quadratic problems (QP) 



that are easier to minimize. In particular, at the fc-th iteration 
we use the second-order Taylor expansion of </> around and 
approximate the Hessian by a positive scalar (77^) multiple of 
the identity matrix, resulting in the following minimization 
problem: 



/ 



fe+i 



arg mm / 
/ 

subject to / ^ 0, 



f 



2t 

Vk 



pen(/) 
(19) 



which can be viewed as a denoising subproblem applied to the 
gradient descent. This gives us considerable flexibility in de- 
signing effective penalty functions and in particular allows us 
to incorporate sophisticated "sparsity models" which encode, 
for instance, persistence of significant wavelet coefficients 
across scales to improve reconstruction performance. In the 
experiments below we utilize one such penalty, a partition- 
based estimator, as described in |10|. 

Partition-based methods calculate image estimates by de- 
termining the ideal partition of the domain of observations 
and by using maximum likelihood estimation to fit a model 
(e.g., a constant) to each cell in the optimal partition. The 
space of possible partitions is a nested hierarchy defined 
through a recursive dyadic partition (RDP) of the image 
domain, and the optimal partition is selected by pruning a 
quad-tree representation of the observed data to best fit the 
observations with minimal complexity. We call this partition- 
based algorithm SPIRAL-RDP. An additional averaging-over- 
shifts (cycle spinning) step can be efficiently incorporated to 
yield a translationally-invariant (TI) algorithm, which we call 
SPIRAL-RDP-TI, that results in more accurate reconstruc- 
tions. 

The main computational costs of the SPIRAL methods come 
from matrix-vector multiplications involving A for calculating 
rjk and V4){xk) in (19i. Specifically, at each iteration fc. 



SPIRAL computes two matrix-vector multiplications each with 
A and with A^ . For SPIRAL-RDP and SPIRAL-RDP-TI, 
even though the partition-based penalty QP appears difficult 
to solve because of its nonconvexity due to the penalty term, 
its global minimizer is easily computed using a non-iterative 
tree-pruning algorithm (see |j4J and |10J for details). 

We evaluate the effectiveness of the proposed approaches 
in reconstructing a piecewise smooth function from noisy 
compressive measurements. In our simulations, the true signal 
(the black line in Figs. [TJa) and[TJb)) is of length 1024 and 
total intensity / = 8.2 x 10^. We take 512 noisy compressive 
measurements of the signal using a sensing matrix that con- 
tains 32 randomly distributed nonzero elements per row. This 
setup yields a mean detector photon count of 50, ranging from 
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Fig. 1. Reconstruction results using the SPIRAL algorithm, (a) SPIRAL-RDP yields a nonsmooth piecewise constant estimate /rdp with a corresponding 
risk i?(/RDPi/*) = 7.552 X 10~^. (b) SPIRAL-RDP-TI produces a smoother and more accurate reconstruction /rdp-ti with corresponding risk 
^{/rdp-TIi /*) = 4.468 X 10^^. The plot (c) shows the reconstruction accuracy versus number of measurements for two particular values of the intensity 
I (10^ and 10^). 



as few as 22 photons, to as high as 94 photons. We allowed 
each algorithm a fixed time budget of three seconds in which 
to run, which is sufficient to yield approximate convergence 
for all methods considered. Each algorithm was initialized at 
the same starting point: if z = A'^y, and x : Xi — yi/{Az)i, 
then we initialize with : = Zi{A^ x)i/{A^\)i, where 1 
is a vector of ones. The value of the regularization parameter 
T was tuned independently for each algorithm to yield the 
minimal risk R{f,f*) = \\f — f*\\\/l'^ at the exhaustion of 
the computation budget. Tuning the regularization parameter 
in this manner is convenient in a simulation study. However, in 
the absence of truth, one typically resorts to a cross-validation 
procedure to determine an appropriate level of regularization. 

In Fig. [ija), we see that models within a partition (constant 
pieces) are less smooth than that of the original signal; 
however this drawback can be effectively neutralized through 
cycle spinning (see Fig. [TJb)). In addition, the accuracy of the 
reconstruction (measured using the risk /*)) is improved 
by this averaging of shifts. Specifically, the SPIRAL-RDP es- 
timate /rdp has a risk of i?(/RDP, f *) — 7.552 x 10^^, while 
the SPIRAL-RDP-TI estimate /rdp~ti achieves a much 
lower risk of i?(/RDP-Ti, /*) = 4.468 x 10"^ In Fig.jljc), we 
examine how the performance of both partition-based SPIRAL 
methods scale with the number of measurements. These results 
utilize the same true signal and sensing matrix type as before, 
and are averaged over four noise realizations. By choosing 
two different intensity levels, we see that a higher intensity 
consistently leads to better performance. However, for a fixed 
intensity, the benefits of a higher number of measurements are 
less pronounced since collecting more observations necessarily 
results in a lower intensity per observation and hence noisier 
measurements (i.e., less photons collected per measurement). 
Note that the plot in Fig. [TJc) is not smoothly decreasing as 
one would expect; as we change the number of measurements, 
we need to randomly generate new Poisson realizations of our 
data, and hence there is some degree of variability in these 
results. 

In |4j, we examine our SPIRAL approach with an £i- 
norm penalty on the coefficients of a wavelet expansion 
of the signal. In this case, the resulting reconstruction is 



very oscillatory with pronounced wavelet artifacts. With an 
increase in the regularization parameter these artifacts can be 
minimized; however, this leads to an "oversmoothed" solution 
and increases the risk of the estimate. In addition, we compare 
the SPIRAL approaches to the more established expectation- 
maximization algorithms based upon the same maximum 
penalized likelihood estimation objective function found in 
([19]) and demonstrate that reconstructions from the partition- 
based SPIRAL methods are more accurate. In particular, they 
produce estimates with lower risk values, are more effective in 
recovering regions of low intensity, and yield reconstructions 
without spurious wavelet artifacts. 

We mention other recent approaches for solving Poisson 
inverse problems; given that a detailed comparison of the per- 
formances of these various methods is beyond the scope of this 
paper, we simply outline some potential drawbacks with these 
approaches. In p9) , the Poisson statistical model is bypassed 
in favor of an additive Gaussian noise model through the use of 
the Anscombe variance-stabilizing transform. This statistical 
simplification is not without cost, as the linear projections of 
the scene must now be characterized as nonlinear observations. 
Other recent efforts Q, pOJ solve Poisson image reconstruc- 
tion problems with total variation norm regularization, but the 
method involves a matrix inverse operation, which can be 
extremely difficult to compute for large problems outside of 



deconvolution settings. Finally, the approaches in |31|, |32| 
apply proximal functions to solve more general constrained 
convex minimization problems. These methods use projection 
to obtain feasible iterates (i.e., nonnegative intensity values), 
which may be difficult for recovering signals that are sparse 
in a noncanonical basis. 

V. Conclusions 

We have derived upper bounds on the compressed sensing 
estimation error under Poisson noise for sparse or compress- 
ible signals. We specifically prove error decay rates for the 
case where the penalty term is proportional to the ^o-norm 
of the solution; this form of penalty has been used effec- 
tively in practice with a computationally efficient Expectation- 
Maximization algorithm (cf. j22|), but was lacking the theoret- 
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ical support provided by this paper. Furthermore, the main the- 
oretical result of this paper holds for any penalization scheme 
satisfying the Kraft inequality, and hence can be used to 
assess the performance of a variety of potential reconstruction 
strategies besides sparsity-promoting reconstructions. 

One significant aspect of the bounds derived in this paper 
is that their signal-dependent portion grows with N, the size 
of the measurement array, which is a major departure from 
similar bounds in the Gaussian or bounded-noise settings. It 
does not appear that this is a simple artifact of our analysis. 
Rather, this behavior can be intuitively understood to reflect 
that elements of y will all have similar values at low light 
levels, making it very difficult to infer the relatively small 
variations in Af*. Hence, compressed sensing using shifted 
Rademacher sensing matrices is fundamentally difficult when 
the data are Poisson observations. 

Appendix A 
Proof of Theorem[T] 

The proof makes heavy use of the geometric approach of 
|[l9l, 1*201. Since this approach is not as well-known in the 
compressed sensing community as the usual RIP, we give a 
brief exposition of its main tenets. Consider the problem of 
recovering an unknown signal /*, which resides in some set 
A C M™, from N linear measurements of the form yi = 
{Zi, /*), . . . ,yN ~ {Zn, f*), where the measurement vectors 
Zi,...,Zn E M™ are i.i.d. samples from a distribution /i 
which is: 

• subgaussian with constant Q, i.e., there exists a constant 
C > 0, such that for Zq ^ fj, and for every u £ M™, 

<2^<CI|-||2 (20) 



inf I s > : E exp 

isotropic, i.e., for Zq 

E\{Zo,u)\^ = \\u\\l 



fj, and for every u G 



The main results of 1 19 1 revolve around the norm preservation 
properties of the random operator A ; M™ defined by 



1 



Au = ^{Zi,u)e„ 



(21) 



where ei, . . . , e^r is the standard basis in . Particularized 
to the case of A = B"\ the unit ball w.rt. the £i norm on 
M™, the first main result of | |T9) reads as follows: 

Theorem 4 Theorem A] Let Zi, . . . , Zn £ M™ be i.i.d. 
samples from an ensemble fj, which is isotropic and subgaus- 
sian with constant C > I. There exist absolute constants 
Ci,C2 > 0, such that, with probability at least 1 — e^^^^/^ , 



\\u~v\\l <4\\A{u-v) 
for all u,v e B™. 



N 



(22) 



The other main result of |[T9), informally, states the following: 
for any finite set T C S"^^, the random operator A does not 
distort the norms of the elements of T too much, provided 
the number of measurements N is sufficiently large. The 



required minimal number of measurements is determined by 
the cardinality of T. In its precise form, the relevant result of 



1 19 1 says the following: 



Theorem 5 ^19\ Corollary 2.7] There exist absolute con- 
stants €3,04 > 0, such that the following holds. Consider 
a finite set T C 5™^^, and let Zi, . . . , Zjv G be i.i.d. 
samples from a Q- subgaussian isotropic ensemble. Then, with 
probability at least 1 — e~'^^^^^ , 



1 ^ 

<\\Au\\l^-Y,\{Z,,u)\ 



provided N > C4C,* log2 |A|. 



2 3 



Vm e A (23) 



Remark 4 Theorem |5] is a special case of a more general 
result that applies to general (not necessarily finite) subsets 
T of the unit sphere. The minimum necessary number of 
measurements is determined by the so-called -functional of 
T, which is defined as follows. Let gi, . . . , g,„ be independent 
standard Gaussian random variables, i.e., each gi ^ N{0, 1) 
independently of all others. Then 



E sup 



i=l 



where the expectation is taken w.rt. 51, ... , g„, and Ui denotes 
the ith component of u. Informally, £^, (T) measures how much 
the elements of T are correlated with white Gaussian noise. 
Estimates of £^ (T) are available for many sets T. For instance 
(see Section 3 of |19| and references therein): 

. If T is a finite set, then £^{T) < c^/log2 \T\ for some 

absolute constant c > 0. 
• If T is the set of all convex combinations of unit vectors 

in M™ whose £0 norm is at most k, 

T = convhufl {u e S"'-^ : \\u\\o = \{i : Ui ^ 0}| < k} , 

(24) 

then £*(r) < c-\/ k logj (cm/fc) for some absolute con- 
stant c > 0. (We do not use this particular T in our 
analysis, but mention it here because of its connection to 
the more widely known RIP p2).) 
The minimum necessary number of measurements for (j9]l to 
hold with the same probability as before is determined by 
N > C4C''^*(T)2. When |r| is finite, combining this bound 
with the estimate of in terms of the log-cardinality of 

T yields Theorem |5] Moreover, as shown in |20|, the usual 
RIP for matrices with rows drawn from subgaussian isotropic 
ensembles is a consequence of this result as well. Specifically, 



it relies on the £^{T) estimate for the set T defined in (24i. 



We now apply Theorems ffland [5] to the measurement 



matrix A defined in Section 

and let Zi = (Z^^i, . . . , Z,^,„ 



II-A 



Recall that A = Z/VN^ 



) denote the ith row of Z. By 
construction, each Zi is an independent copy of a random 
variable Zq G M™ with disti'ibution i/^™ — that is, the 
components of Zq are drawn i.i.d. from lyp. To be able to 
apply Theorems |4] and |5] we first show that the distribution 
t-^™ is subgaussian and isotropic. To that end, we need to 
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obtain a bound of the form ( |20| l for linear functionals of the 
form {Zo,u). The infimum on the left-hand side of Eq. (20i is 
the so-called Orlicz ■ip2-norm of the random variable {Zq,u). 
Given a real- valued random variable U, its Orlicz ijj2-noim 
| |33] Ch. 2] is defined as 

^2 



< 2 



= inf |s > : Ecxp 

Thus, fj, is subgaussian with constant C, if for Zq ^ /i we have 

11(^0, <CII"ll2, VueM". 
Here is a useful tool for bounding Orlicz norms: 

Lemma 1 [33, Lemma 2.2.1 ] Let U be a real-valued random 
variable that satisfies the tail bound 

V[\U\ >t]< Ke-^^^ 

for all t > Q, where K,C > are some constants. Then its 
Orlicz 'ip2-norm satisfies ||?7||^)2 < •\/(l + K)/C. 

Using this lemma, we can prove the following: 

Lemma 2 The product probability measure is isotropic 
and subgaussian with constant C,p defined in 



Proof: Let Zq = (Zo,i, 



,0Tn 



. Isotropy 



is immediate from To prove subgaussianity, let us first 
assume that p ^ 1/2. Fix some u E and consider 
the random variable {Zq,u), which is a sum of independent 
random variables ZojUj, 1 < j < m. Then E{Zo,u) = 0, 
and with probability one each ZojUj takes values in the set 
{-X-\ujl-X+\uj\} if < 0, and in {X-\uj\, X+\uj\} if 
Uj > 0. Hence, Hoeffding's inequality gives the tail bound 



2^2 



(Ap - Xp )2 

2p{l-p)t^' 



P[\{Zo,u) \ >t]< 2exp 



2 exp 



Using Lemma [T| with K = 2 and C = 2p(l — p)/||u||2, we 
get the desired result. When p = 1/2, using the fact that the 
Rademacher measure is symmetric, it can be shown that i/fj^ 
is subgaussian with constant C 1 1,19]. ■ 
Now, using Theorems |4|and|5] in conjunction with Lemma|2] 
we have proved TheoremTT] 

Appendix B 
Proof of Theorem|2] 

With probability at least 1 — e^'^^^^'^p, the following chain 
of estimates holds: 

4 



< j^\m*~fm + 



2c2C4l0g(c2C>/iV) 



N 

||^(^._^^^,|,^2c|C,Mog(c.C>/A^) 



< 



pil-p)P 



\\Mr~f)\\i + 



N 

2 2c2c4log(c2Cp*m/7V) 



N 



where the first inequality is a consequence of the first part of 
Theorem [T] and the remaining steps follow from definitions, 
standard inequalities for ^ norms, and the fact that ^ • fi — 
J2if! for all / e r Q Moreover, 



ii^(r-/)ii? 



N 



a/2 



(Af*)'/' + (Af) 



1/2 



N 



< E W*)' -W)' {Af*)f + iAf) 



.1/2 



N 



< \iAfnr~iAf) 



,1/2 



N 



{Af*), + {Af), 



< ilJ2\{Af*) 



1/2 



{Af) 



1/2 



where the first inequality follows from Cauchy-Schwarz, the 
second is a consequence of the arithmetic-mean/geometric- 
mean inequality, and the third follows fromj5]l. It is a matter 
of straightforward algebra (see Appendix D-A below) to show 
that 



N 



1/2 



{Af) 



1/2 



—2 log Y[ exp 

i=i 

21og- 



1 r 

2 
1 



iAf*)y'-{Af)y' 



(25) 



JJp{y\Af*)piy\Af)d,.iy) 



where v is the counting measure on Z^. Now, the same 
techniques as in Li and Barron |34| (see also the proof of 



Theorem 7 in p5j or Appendix D-B below) can be used to 
show that 



2E log ■ 



1 



JJp{y\Af*)piy\Af)di.{y) 



< min 

/er 



KL 



{p{-\Af*)\\p{-\Af))+2pcn{f) 



(26) 



where KL(-||-) is the Kullback-Leibler (KL) divergence, which 
for the Poisson likelihood has the form 



KL {p{-\Af*)\\p{-\Af) 

N r 



{Af*)^\og^-^-{Af*). + {Af). 



Using the inequality \ogt < t — 1 together with (12i, which 
holds with probability at least 1 — Np"\ and (j6|l, we can bound 

'^We use the fact that \\A{f* - /)||2 = _ /)||2, which is 

only true when / = = - If we did not assume / was known, we 

would have additional terms in our error bound which would be proportional 
to / and would not reflect our ability to exploit prior knowledge of structure 
or sparsity to achieve an accurate solution. 
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the KL divergence as 

N 



E 



(An, 



{An^ + {Af), 



N r 



< 



i^rU (^-^S^ - 1 ) - {Afh + iAf)r 



1 



< 



E 

4 = 1 

N 

5ii^(r-/)ii2 



V (Afh 
[{Afr,-2{AfUAr 



Cl 



mr-fw 



Now, choose any T* e ^/Ar.p, such that 



R*{f*,T* 



min R*{f*,T). 



Observe that N > C4Cp log |T*|, so we can apply the second 
part of Theorem jlj to the set | \\flfy : / e T* |. Then, with 



probability at least 1 



-C3JV/C 



p, we have 



iiA(r-/)ii^<(3/2)iir-/ii: 



V/ e T* 



Combining everything and rearranging, we get the bound 

Ei?(r,/) < 



/ex- 



2pcn(/) 



2c2log(c2C>/A^) 



N 



which holds with probability at least 1 — [Np" 



-C3N/C 



> 1 — me 



-KN 



for a suitable constant K — 



K{ci,C3,p). The theorem is proved. 

Appendix C 
Proof of Theorem[3] 

In order to apply Theorem |2j we will form a suitable finite 
class of estimators F and set a penalty function pen(/) over 
this class which (a) is smaller for sparser 6 = W'^ f and (b) 
satisfies ( [TT] i. The family F is constructed as follows. 

1) Define the sets 

Q^{ee M" : ll^lloo < /; 

each 9i uniformly quantized into one of 
y/m bins over [—/,/]} 

andT={f = We:de Q}. 

2) For each f G J^, let / denote its £2 projection onto the 
closed convex set 

LeM™ (c/)l„xi and ^/l , 



I.e., 



/ = argmin ||/ - g\\2. 

gee 



3) Finally, let T = {6 ^ W'^ f : f eT}. 
Note that the projection / satisfies the Pythagorean identity 

||ff-/||^>||g-/l^ + ||/-/||i V,geC 



(see, e.g.. Theorem 2.4.1 in p6)). In particular, \\g — /jH > 
\\g — /II2, and, since /* e C, we have 



iir-/l2<iir 



fWl 



V/e 



(27) 



Consider the penalty 

pen(/) = log2(m + 1) + (3/2)||0||o log^lm), 9 = f 
(measured in bits; in order to satisfy Kraft's inequality as stated 



in ( 1 1 1, it will need to be rescaled by log 2). This corresponds 
to the following prefix code for 6* G O (that is, we encode the 
elements of 8, before they are subjected to the deterministic 
operation of projecting onto C): 

1) First we encode ||6'||o, the number of nonzero compo- 
nents of 6, which can be encoded with log2(m + l) bits. 

2) For each of the 1 161110 nonzero components, we encode 
its location in the 6 vector; since there are m possible 
locations, this takes log2(TO) bits per component. 

3) Next we encode each coefficient value, quantized to one 
of -^m uniformly sized bins. 

Since this corresponds to a uniquely decodable code for f E J- 
(or 9 € 6), we see that pen(/) indeed satisfies the fCraft 
inequality (37). 

= W'^f*, let 9^''') be its best /c-term 
G 8 the quantized version of 9'-''\ for 



Now, given 
approximation, 
which we have 



n(fe) 



^1 



and 

W9. 



~(k) (k) 

3q the element of F obtained by projecting fq ~ 



(fe) 



q onto C and then transforming back into the basis $: 



iir-/ri 



q 

Xk)\ 



Then, using (|27|i and (|16|l, we get 



I < 



mi 



< 2\\9* 



< 



2||e'('^') 

2k 



Given each 1 < fc < m, let F^ C F be the set of all 9 e T, 
such that the corresponding 9 E Q satisfies ||^?||o < fc. Then 
|Ffe| (")m'=/2, so that logj |Ffc| < 2fclog2r7i, and therefore 
Tfc G Gn,p whenever 



fc < k*{N), where fc*(7V) 



N 



2c4Cp log2 m 



Then the first term on the right-hand side of (13i can be 
bounded by 



Cn min i?*(r,Ffc) 

l<k<k,{N) 



< 0(N) min 

l<fe<fc,(A') 



< 0(N) min 

l<fe<fc.(A') 



^1 



2pen(/('=)) 
/ 



-2a 



fc log2 m 



where the constant obscured by the 0( ) notation depends only 
on p, p, C and c. 
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Appendix D 
Auxiliary technical results 



A. Proof of ^\ 

Given two Poisson intensity vectors g,h G E.^, we have 

\/piy\9)piy\h)di^{y) 

= n / ^P^Vi\9i)p{yi\hi)dv^{yi) 

i=i 

N CO , , w II 



i=\ yi=0 

N 

-{a^ 



n- 

i=l 
N 

n 

i=l 

N 



^ y^\ 



e HM'^'-c^-y^'f I p{y.\i9.h.yndMyO 



i=l 



where Vi denotes the counting measure on the ith component 
of y. Taking logs, we obtain 



2 log 



/ \/p{y\9)p{y\h)dy{y) 



N 

Y.{!^9^Y"~{hY 



/2 



The quantity on the left-hand side is often used to measure 
divergence between probability distributions, and dates back 
to the work of Bhattacharyya fSFl and Chernoff p9). 



B. Proof of (26) 



For the sake of brevity, we will write Pf{y) and Pf{y) 
instead of p{y\Af*) and p{y\Af). Also, define the Hellinger 
affinity 

A{f*J)^ Upr{y)Pf{y)dv{y). 



We then have 
2 log ^ 



A{f\f) 



2 log 



log^^^+2pen(7). 



In the first term on the right-hand side, the ratio is evaluated 
at /. Replacing this ratio by the sum of such ratios evaluated 
at every / e F, we obtain the upper bound 



2 log 5] 



^Pf[y)/Pr{y)e-^-^(f^ 



+logPZl^+2pen(/). 

Pfiy) 



Now we take expectation w.r.t. pf*{y). Then, by Jensen's 
inequality. 



E;. <jlog^ 



VpMJPf^e-P<^-<-f^ 



A{f*J) 



g- pcn(/) 

Aif*J) 



Et 



f \ \Jpf{v)/pf {y) 



=AU'J) 



= log^e-P™(^) < 0. 
/er 

By definition of /, we have 

Pfiy) 



log 



Thus, 



Pf{y) 



2 pen(/) < min 



/er 



log^^ + 2pen(/) 

Pfyy) 



log^+2pen(7) 



Pfiy) 



< mill 

/er 



Ej, log^^^ + 2pen(/) 

Pfiy) 



= min [KL(p/. + 2pcn(/)] . 

Putting everything together, we get the bound 
1 

AiKT) 

KL{p{-\Af*)\\pi-\Af))+2pen{f) 



2Elog - 



< min 
/er 



as advertised. 
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